Import all libraries needed in the Homework
In [1]:
import pandas as pd
from IPython.display import display, HTML
import os
import numpy as np
import json
import folium
import copy
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from IPython.display import display, Image
%matplotlib inline
Go to the eurostat website and try to find a dataset that includes the european unemployment rates at a recent date.
Use this data to build a Choropleth map which shows the unemployment rate in Europe at a country level. Think about the colors you use, how you decided to split the intervals into data classes or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.
We downloaded the set : Unemployment rates by sex, age and nationality (%) which is a excel file(.xls)
In [2]:
# Import the excel file and call it xls_file
xls_file = pd.ExcelFile('Data/lfsa_urgan.xls')
initial_df = xls_file.parse()
#display the dataframe created above
display(initial_df.head())
We see that there are a lots of NaN because the dataframe represents exactly the cells of the excel files. And inside the file lots of cells are empty. We also observe in the file that there are 3 tables. One for the total of the population, One only for the Males and the last only for females. Therefore now we have to extract the table with the unemployement rate of the total population for 2016 for All the european countries.
To do so we :
- Extract the lines 17 to 50 which represents the unemployemend rate for the Total of the population for every country
- keep only the most recents data which are in the last columns of the dataframe (i.e : 2016)
In [3]:
#Rename the useful columns
df = initial_df.copy()
df.rename(columns={'Unemployment rates by sex, age and nationality (%) [lfsa_urgan]':'Country'}, inplace=True)
df.rename(columns={'Unnamed: 10':'2016 Unemployement Rate'}, inplace=True)
#Keep the useful culumns
df = df[['Country','2016 Unemployement Rate']]
#Keep the lines representing individual countries between the 17th and the 49th
df_european_unemployement = df[df.Country=='European Union (28 countries)']
df_final = df[17:50].copy()
# Clean the name of the countries
df_final.Country.replace('Germany (until 1990 former territory of the FRG)','Germany',inplace=True)
df_final.Country.replace('Former Yugoslav Republic of Macedonia, the','The former Yugoslav Republic of Macedonia',inplace=True)
#display the head of the dataframe
display(df_final.head())
To make it possible to visualize the countries that we do not have data on. We will create a cleaned toposjon that only contain the matched countries.
In [4]:
#Open the topojson for europe
with open('topojson/europe.topojson.json') as f:
topo_json = json.load(f)
#create an array
topo_countries = np.zeros(len(topo_json['objects']['europe']['geometries']), dtype=object)
# Upload the array with the informations inside topojson
for i, id_nr in enumerate(topo_json['objects']['europe']['geometries']):
topo_countries[i] = id_nr['properties']['NAME']
#sort the array topo_country
topo_countries = sorted(topo_countries)
#print th countries in topojson europe
print("Number of countries in 'topojson/europe.topojson.json':", len(topo_countries))
print(topo_countries)
#print the countries in the datas with unemployement rate
unemploy_countries = sorted(df_final['Country'])
print("\nNumber of countries in Unemployed rate data:", len(unemploy_countries))
print(unemploy_countries)
Comment:
We display all the countries which we do not have data on.
Then, we compute a topojson without the irrelevant countries displayed
In [5]:
# Create a list where we will add all the country that are inside the dataFrame and inside the europetopojson
matching_countries = []
for topo_country in topo_countries:
for unemploy_country in unemploy_countries:
if unemploy_country == topo_country:
matching_countries.append(topo_country)
#display all the countries that we will delete from the topojson and countries in our data not matching with the europetopojson
no_match_topo_countries = np.setdiff1d(ar1=topo_countries, ar2=matching_countries)
no_match_unemploy_countries = np.setdiff1d(ar1=unemploy_countries, ar2=matching_countries)
print("Countries we are about to delete from our topojson:\n", no_match_topo_countries)
print("\nCountries not matched from Unemployed rate data:\n", len(no_match_unemploy_countries))
# load once again europe.topojson
with open('topojson/europe.topojson.json') as f:
eur_topojson = json.load(f)
cleaned_eur_topojson = eur_topojson
# we keep only the relevant data in our cleand eur topojson
cleaned_eur_topojson['objects']['europe']["geometries"] = [x for x in cleaned_eur_topojson['objects']['europe']["geometries"]
if x["properties"]["NAME"] not in no_match_topo_countries]
Now we have a dataframe with the county and their unemployement rate of 2016 and a topojson with ONLY the countries with datas. We will use this datas to build a Choropleth map which shows the unemployment rate in Europe at a country level. To do so we will implement a few things :
All the countries with transparent layers are countries that we did not find unemployement rate. We deleted them from the topojson just above
In [6]:
def quantile_thresholds(df, num_q):
"""
This function takes in a df and number of desired quantiles. This is a way of making discrete categories of the
data so that the probability of beeing in each category is approximately the same.
It returns a list of the thresholds.
These thresholds can be used to construct the categories in the choropleth map.
"""
q_thresholds = []
# Create different thresholds using the DataFrame.quantile() function
for x in range(num_q):
q_thresholds.append(df.quantile(x/(num_q)))
q_thresholds.append(np.floor(max(df) + 1))
q_thresholds = np.array(q_thresholds).astype('int')
return list(q_thresholds)
In [7]:
#load json data on all of the european countries
country_geo_path = 'topojson/europe.topojson.json'
geo_json_data = open(country_geo_path)
#countries that we have unemployed data on
cleaned_topojson = cleaned_eur_topojson
#build a blank map of the europe
m_europe = folium.Map([48,8], tiles='cartodbpositron', zoom_start=2.6)
#create a layer of borders from all the countries in europe
folium.TopoJson(geo_json_data,
'objects.europe',
name='nodata_json',
style_function=lambda feature: {
'fillColor': 'transparent',
'color' : 'black',
'weight' : 2,
'opacity' : 0.8
}
).add_to(m_europe)
#creating a layer of 'Quantile distribution of countries in categories'
m_europe.choropleth(geo_data=cleaned_topojson,
name='Linear thresholds',
data=df_final,
columns=['Country','2016 Unemployement Rate'],
key_on='feature.properties.NAME',topojson='objects.europe',
fill_color='BuPu', fill_opacity=0.9, line_opacity=0.2,
legend_name='Linear coloring: Percentage of unemployement rate (%)')
#creating a list of thresholds for the colorcoding of the quantile-choropleth
threshold_scale = quantile_thresholds(df_final['2016 Unemployement Rate'], 5)
#creating a layer of 'Quantile distribution of countries in categories'
m_europe.choropleth(geo_data=cleaned_topojson,
name='Quantile distribution of countries in categories',
data=df_final,
columns=['Country','2016 Unemployement Rate'],
key_on='feature.properties.NAME',topojson='objects.europe',
fill_color='BuPu', fill_opacity=0.9, line_opacity=0.2,
threshold_scale = threshold_scale,
legend_name='Quantile coloring: Percentage of unemployement rate (%)')
#create a symbol that make you able to turn ON and OFF the different layers
folium.LayerControl().add_to(m_europe)
#display the map
m_europe
Out[7]:
There is a white icon with "layers" in the upright corner of the map. Press this map to choose which layers to activate. The colors do NOT represent the same interval of values it is only two different ways to visualize RELATIVE difference in unemployment between countries in Europe. When analysing map KEEP ONLY ONE LAYER on at the same time!
There are 3 layers:
In [8]:
# We create a dataframe call compare with the data for switzerland and the mean for the rest of the europe
frames = [df_final[df_final.Country == 'Switzerland'],df_european_unemployement.iloc[[0]]]
result = pd.concat(frames)
display(result)
We have the unemployement rate of 28 countries in Europe including Switzerland. We now try to display all the results in a histogram. We will use a bar plote.
In [9]:
# Fonction to plot the histogram of unemployement rate
# Display the information in colors for datas displaied above
def plot_bars(column_name, number_bars, df, y_ticks_label, diff_colors):
# create subplots and title
fig, ax = plt.subplots()
fig.suptitle('Swiss unemployment ratio compared to the rest of europe', fontsize=20)
fig.set_size_inches(12, 10)
# Define the variables
N = 5
ind = np.arange(number_bars)
width = 0.5
vals = df.head(number_bars)[column_name]
# Setting colors of the regions/countries
df = df.set_index(['Country'])
# Using cm to construct colors of the length of diff_colors
colors = cm.rainbow(np.linspace(0, 1, len(diff_colors)+1))
colorlist = colorlist = [colors[0]] * df.shape[0]
# Map the different regions to location in dataframe to get the right colors assosiated with the diff countries
for c_ind, region in enumerate(diff_colors):
colorlist[df.index.get_loc(region)] = colors[c_ind+1]
# Plot the data
ax.barh(ind, vals, width, color=colorlist, align='center')
# Set labels and ticks
ax.set_xlabel('Unemployment ratio in (%)', fontsize=20)
ax.set_ylabel('Countries/Regions', fontsize=20)
ax.set_yticks(np.arange(number_bars))
ax.set_yticklabels(y_ticks_label, fontsize=12)
# Values read top-to-bottom by default
ax.invert_yaxis()
# Show grid to easier evaluate data
ax.xaxis.grid()
plt.show()
In [10]:
# Include the unemployment combined numbers for the EU nations
df_final = pd.concat([df_final,df_european_unemployement.iloc[[0]]])
df_final.drop_duplicates(inplace=True)
# Sort the dataframe after Uemployment rate so it is easier to do comparison between the countries
df_final = df_final.sort_values(by='2016 Unemployement Rate', ascending=False)
# Making the switzerland and EU stand out with colors in the histogram
countries_diff_color = ['Switzerland', 'European Union (28 countries)']
# Plot the histogram
plot_bars('2016 Unemployement Rate', len(df_final), df_final, df_final.Country, ['Switzerland', 'European Union (28 countries)'])
As shown by the bar plot above, we note that the unemployment rate of Switzerland in 2016 is generally lower than the unemployement rate of the european countries. One can also see the average Unemployment rate of the European Union. Again remember that there are over 20 other countries not included in these numbers. Mostly smaller sovereign principalities like Monaco and Andorra, some slavic countries in eastern europe and some jugoslavian countries.
PS: not all these countries are in the EU like Iceland and Norway that have very low Unemployment ratios.
Go to the amstat website to find a dataset that includes the unemployment rates in Switzerland at a recent date.
HINT Go to the details tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through.
Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.
The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.
Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.
We go to : https://www.amstat.ch/v2/index.jsp and download the informations of unemployment rate at the level of swiss cantons and the statistics people who already have a job and are looking for a new one. We try to download the file in German to get the german names of the cantons to match the names in the topojson file.
We get an excel file.
The Cambridge Dictionary defines the active population to be "the people in a country or area who normally work and earn money".
We considered the Unemployement Rate from amstat to be the Number of people without a job divided by the Active Population times 100. The homework definition of Unemployement Rate wasn't clear and did not seem relevant. We sticked to what we found on the amstat description. We Create and Analyze the Jobseeking rate to be the number of People searching for a job(currently with or without a job).
In [11]:
# Import the excel file and call it xls_file
xls_file = pd.ExcelFile('Data/amstat_DE.xlsx')
initial2_df = xls_file.parse()
#display the head of the dataframe
display(initial2_df.head())
From the Dataframe above we now :
(NB Unemployed/Unemployement Rate) *100
(NB Job Seek/Active Population) * 100
In [12]:
#Rename the useful columns
df2 = initial2_df.copy()
df2.rename(columns={'Unnamed: 2':'Unemployement Rate','Unnamed: 3':'Nb Unemployed','Unnamed: 4':'Job Seeker'}, inplace=True)
df2.rename(columns={'2.1 Arbeitslosenquoten':'Canton'},inplace=True)
#Keep the useful culumns
df2 = df2[['Canton','Unemployement Rate','Nb Unemployed','Job Seeker']]
#keep the Unemployement rate in a dataframe
df_swiss_unemployement = df2[df2.Canton=='Total']
# Find the number of the active population for avery row
df2_final = df2[3:29].copy()
df2_final['Active Population'] = round(pd.to_numeric(df2_final['Nb Unemployed']).div(pd.to_numeric(df2_final['Unemployement Rate']))*100)
#then we calculate the number of people looking for a job divided by the size of the active population (scaled by 100).
# We call it JobSeeker Rate
df2_final['JobSeeker Rate'] = round(pd.to_numeric(df2_final['Job Seeker']).div(pd.to_numeric(df2_final['Active Population']))*100,1)
# convert unemployement rate as float
df2_final['Unemployement Rate']=df2_final['Unemployement Rate'].astype(float)
#Create RostiData to use it in the BONUS
RostiData = df2_final.copy()
df2_final = df2_final[['Canton','Unemployement Rate','JobSeeker Rate']]
#We create a column with the ID of every cantons to use it in the graph
df2_final['ID']=['ZH','BE','LU','UR','SZ','OW','NW','GL','ZG','FR','SO','BS','BL','SH','AR','AI','SG','GR','AG','TG','TI','VD','VS','NE','GE','JU']
df2_final = df2_final.reset_index(drop=True)
display(df2_final.head())
Now that we have our data we search for the data with AGE CLASSES. To do so we download another excel file from amstat with the age class. We will get three :
We Do the same work on this datas that above for the whole population. We will fin the Job seeker Rate and Unemployement Rate for every canton and every Age Class(see the explanation for the part above)
In [13]:
# Import the excel file and call it xls_file
xls_file = pd.ExcelFile('Data/amstat_infos.xlsx')
initial_age_df = xls_file.parse()
#Rename the useful columns
swiss_age = initial_age_df.copy()
swiss_age.rename(columns={'Unnamed: 2':'Age Class','Unnamed: 4':'Unemployement Rate','Unnamed: 5':'Nb Unemployed','Unnamed: 6':'Job Seeker'}, inplace=True)
swiss_age.rename(columns={'2.1 Taux de chômage':'Canton'},inplace=True)
#Keep the useful culumns
swiss_age = swiss_age[['Canton','Age Class','Unemployement Rate','Nb Unemployed','Job Seeker']]
#keep the Unemployement rate in a dataframe
df_swiss_unemployement_age = swiss_age[swiss_age.Canton=='Total']
# Find the number of the active population for avery row
swiss_age = swiss_age[3:81].copy()
swiss_age['Active Population'] = round(pd.to_numeric(swiss_age['Nb Unemployed']).div(pd.to_numeric(swiss_age['Unemployement Rate']))*100)
#then we calculate the number of people looking for a job divided by the size of the active population (scaled by 100).
# We call it JobSeeker Rate
swiss_age['JobSeeker Rate'] = round(pd.to_numeric(swiss_age['Job Seeker']).div(pd.to_numeric(swiss_age['Active Population']))*100,1)
swiss_age['Unemployement Rate']=swiss_age['Unemployement Rate'].astype(float)
#Keep the relevant Datas
swiss_age = swiss_age[['Canton','Age Class','Unemployement Rate','JobSeeker Rate']]
swiss_age = swiss_age.reset_index(drop=True)
display(swiss_age.head())
We change the dataframe to have all the datas from one Canton on ONE row
In [14]:
#Copy the datafram swiss_age into a new one to be able to launch this cell again and again.
Swiss_age = swiss_age.copy()
#Iterate over the dataframe
for i in Swiss_age.index :
if (Swiss_age.loc[i, 'Age Class'] == '15-24 ans') :
#update the line
Swiss_age.loc[i,'15-24_Unemp'] = Swiss_age.loc[i,'Unemployement Rate']
Swiss_age.loc[i,'15-24_JobSeek'] = Swiss_age.loc[i,'JobSeeker Rate']
elif (swiss_age.loc[i,'Age Class'] == '25-49 ans'):
#Because we want all the data on the same line, we need to put it at i-1
Swiss_age.loc[i-1,'25-49_Unemp'] = Swiss_age.loc[i,'Unemployement Rate']
Swiss_age.loc[i-1,'25-49_JobSeek'] = Swiss_age.loc[i,'JobSeeker Rate']
else :
#Same case here, we want our data on the first line of the cantons, hence the i-2
Swiss_age.loc[i-2,'50+_Unemp'] = Swiss_age.loc[i,'Unemployement Rate']
Swiss_age.loc[i-2,'50+_JobSeek'] = Swiss_age.loc[i,'JobSeeker Rate']
#We only want to keep every first row for each canton, so we take out the rest
Swiss_age = Swiss_age.take(list(list(range(76))[::3]))
#Getting rid of the rows we do not want.
Swiss_age = Swiss_age.drop(['Age Class', 'Unemployement Rate', 'JobSeeker Rate'], axis=1)
#Because some items were deleting, we must reset the index
Swiss_age = Swiss_age.reset_index(drop=True)
#Get the ID for every Canton
Swiss_age['ID']=['ZH','BE','LU','UR','SZ','OW','NW','GL','ZG','FR','SO','BS','BL','SH','AR','AI','SG','GR','AG','TG','TI','VD','VS','NE','GE','JU']
display(Swiss_age.head())
In [15]:
def create_linear_thresholds(list_of_columns):
"""
create a common linear threshold between the lowest registered value and the highest value across datasets.
Return a list of threshold with linear spacing
"""
# initialize the min and max
temp_min = min(df2_final[list_of_columns[0]])
temp_max = max(df2_final[list_of_columns[0]])
for column in list_of_columns[1:]:
if min(Swiss_age[column]) < temp_min : temp_min = min(Swiss_age[column])
if max(Swiss_age[column]) > temp_max : temp_max = max(Swiss_age[column])
thresholds = list(np.linspace(temp_min-0.001, temp_max+0.001, num=6, endpoint=True))
print(thresholds)
return thresholds
def create_choropleth(s_map, name, data, second_column):
"""
This function create a choropleth map
"""
#load the topojson of the switzerland cantons
swiss_geo_path = 'topojson/ch-cantons.topojson.json'
swiss_json_data = open(swiss_geo_path)
#create the chorpleth layer for the map
s_map.choropleth(geo_data=swiss_json_data,
data=data,name=name,
columns=['ID',second_column],
key_on='feature.id',topojson='objects.cantons',
fill_color='BuPu', fill_opacity=0.8, line_opacity=0.2,
legend_name= name+' in (%)',
threshold_scale=threshold_list)
return s_map
# Create common linear threshold for the jobseeker datasets
# The minimal value will be in df2_final("Unemployment Rate") while the highest will be in the Swiss_age(15-24_JobSeek)
threshold_list = create_linear_thresholds(['Unemployement Rate','15-24_JobSeek','25-49_JobSeek','50+_JobSeek',
'15-24_Unemp','25-49_Unemp'])
In [16]:
#build a blank map of the europe
m_swiss = folium.Map([46.7,8], tiles='cartodbpositron', zoom_start=7.5)
#create different layers of cloropleth for the map m_swiss
m_swiss = create_choropleth(m_swiss, 'TOTAL Job Seeker Rate', df2_final, 'JobSeeker Rate')
m_swiss = create_choropleth(m_swiss, 'Age:15-24 Job Seeker Rate', Swiss_age, '15-24_JobSeek')
m_swiss = create_choropleth(m_swiss, 'Age:25-49 Job Seeker Rate', Swiss_age, '25-49_JobSeek')
m_swiss = create_choropleth(m_swiss, 'Age:50+ Job Seeker Rate', Swiss_age, '50+_JobSeek')
#create a symbol that make you able to turn ON and OFF the different layers
folium.LayerControl().add_to(m_swiss)
#display the map with the different layers
m_swiss
Out[16]:
We now display the map of unemployement Rate
In [17]:
#build a blank map of the europe
m_swiss2 = folium.Map([46.7,8], tiles='cartodbpositron', zoom_start=7.5)
#create different layers of cloropleth for the map m_swiss
m_swiss2 = create_choropleth(m_swiss2, 'TOTAL Unemployement Rate', df2_final, 'Unemployement Rate')
m_swiss2 = create_choropleth(m_swiss2, 'Age:15-24 Unemployement Rate', Swiss_age, '15-24_Unemp')
m_swiss2 = create_choropleth(m_swiss2, 'Age:25-49 Unemployement Rate', Swiss_age, '25-49_Unemp')
m_swiss2 = create_choropleth(m_swiss2, 'Age:50+ Unemployement Rate', Swiss_age, '50+_Unemp')
#create a symbol that make you able to turn ON and OFF the different layers
folium.LayerControl().add_to(m_swiss2)
#display the map with the different layers
m_swiss2
Out[17]:
Use the amstat website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between Swiss and foreign workers.
The Economic Secretary (SECO) releases a monthly report on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for foreign (5.1%) and Swiss (2.2%) workers.
Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (hint The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?
Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.
Once again we go to the amstat website. This time we download the unemployement rate by canton for foreign and swiss workers.
We then create a dataframe with :
diff = ForeignRate-SwissRate
In [18]:
# Import the excel file and call it xls_file
xls_file = pd.ExcelFile('Data/amstat_foreign.xlsx')
initial_foreign_df = xls_file.parse()
#Rename the useful columns
swiss_foreign = initial_foreign_df.copy()
swiss_foreign.rename(columns={'Unnamed: 1':'Nationality','Unnamed: 3':'Unemployement Rate'}, inplace=True)
swiss_foreign.rename(columns={'2.1 Taux de chômage':'Canton'},inplace=True)
#Keep the useful culumns
swiss_foreign = swiss_foreign[['Canton','Nationality','Unemployement Rate']]
#Keep the usful rows
swiss_foreign = swiss_foreign[3:55]
# Create two new columns in the dataframe one for the Foregin Rate and the other for the Swiss Rate
for i in swiss_foreign.index:
if (swiss_foreign.loc[i,'Nationality']=='Etrangers'):
swiss_foreign.loc[i,'Foreign Rate'] = swiss_foreign.loc[i,'Unemployement Rate']
elif (swiss_foreign.loc[i,'Nationality']=='Suisses'):
swiss_foreign.loc[i-1,'Swiss Rate'] = swiss_foreign.loc[i,'Unemployement Rate']
# now we drop one row over 2
swiss_foreign = swiss_foreign[swiss_foreign['Nationality']=='Etrangers']
#keep the relevant columns
swiss_foreign = swiss_foreign[['Canton','Foreign Rate','Swiss Rate']]
#reset the index
swiss_foreign = swiss_foreign.reset_index(drop = True)
#Add a colum with the difference between Foreign rate and swiss rate annd add of the Canton's ID
swiss_foreign['Diff']=pd.to_numeric(swiss_foreign['Foreign Rate'])-pd.to_numeric(swiss_foreign['Swiss Rate'])
swiss_foreign['ID']=['ZH','BE','LU','UR','SZ','OW','NW','GL','ZG','FR','SO','BS','BL','SH','AR','AI','SG','GR','AG','TG','TI','VD','VS','NE','GE','JU']
display(swiss_foreign.head())
Display of the map with color corresponding to the differences between Foreign Unemployment rate and Swiss Unemployement rate for each Canton.
We change the colors of the map this time because we display a differencrial rate. It is not a Real unemployement rate.
In [19]:
#build a blank map of Switzerland
map_diff = folium.Map([46.7,8], tiles='cartodbpositron', zoom_start=7.5)
#load the topojson of the switzerland cantons
swiss_geo_path = 'topojson/ch-cantons.topojson.json'
swiss_json_data = open(swiss_geo_path)
#creation of the choropleth map
map_diff.choropleth(geo_data=swiss_json_data,
data=swiss_foreign,name='Difference between Foreign rate and swiss rate',
columns=['ID','Diff'],
key_on='feature.id',topojson='objects.cantons',
fill_color='BuGn',fill_opacity=0.7, line_opacity=0.2,
legend_name='Difference btw Foreign and swiss Uneployement Rate (%)')
map_diff
Out[19]:
Since the map represents a difference it could have been negative numbers. In that case we would have displayed another color than green, to tend to the red. Shades of red for negative values could have helped to identify the diversity of the gap between foreign and national unemployment
We assumed that we are asked to to analyze the Unemployment Rate for the age classes regardless of the Nationality of the people. We searched thru amstat to get those informations depending on the nationality but we did not find anything.
We use the datas from exercice two with all the age classes for every Canton. We already showed a map above of the unemployement rate for every Age Class.
In [20]:
# We display the dataframe with the unemployement rate for every age class
display(Swiss_age.head())
Now that the DataFrame has handiest shape to be used, we draw the barplot. To do so we take the total unemployment from the second question and add it to Swiss_age in order to be able to sort cantons by unemployment. Doing this will allow us to have a better overview of unemployment with respect to ages.
In order to have a clearer visual, we divide the Swiss_age into 3 dataframes and process them separately. This will allow us to have a clearer display. To implement the barplot we specify some parameters at the begining and then use them throughout the project.
In [21]:
#We extract the total unemployment from the df2, from which we reset the index to avoid mismatches
df2_final = df2_final.reset_index(drop=True)
Swiss_age['Unemployment_tot'] = pd.to_numeric(df2_final['Unemployement Rate'])
Swiss_age = Swiss_age.sort_values('Unemployment_tot', ascending=False)
#We then need to reset the index in order to manipalte the data easily
Swiss_age = Swiss_age.reset_index(drop=True)
#The barplot would have been too big if we plotted it entirely, thus why
#we divided the DataFrame into 3 parts and then display the results.
Age1 = Swiss_age.loc[0:9]
Age2 = Swiss_age.loc[10:15]
Age3 = Swiss_age.loc[16:]
#Defining a function to plot the 3 results.
def Barploting(SubSwissAge):
#Establishing parameters for position and width that we will use all along.
pos = list(range(len(SubSwissAge['Canton'])))
width = 0.25
# Creating the plot
fig, ax = plt.subplots(figsize=(20,10))
#Defining the parameters
plt.bar(pos,
SubSwissAge['15-24_Unemp'],
width,
#Choosing the color:
color='#EE3224',
label=SubSwissAge['Canton'])
#The position has to be further in order not to overlap the first bar
plt.bar([p + width for p in pos],
SubSwissAge['25-49_Unemp'],
width,
# Choosing the color:
color='#F78F1E',
label=SubSwissAge['Canton'])
#The position is 2*width further to avoid overlapping
plt.bar([p + width*2 for p in pos],
SubSwissAge['50+_Unemp'],
width,
# Choosing the color:
color='#FFC222',
# with label the third value in first_name
label=SubSwissAge['Canton'])
#Setting the labels
ax.set_xticklabels(SubSwissAge['Canton'])
#Setting the sizes of the labels
ax.set_xticks([p + 1.5 * width for p in pos])
ax.set_title('Unemployment rate per Canton [%]')
#We must set the scale of y, because we will show 3 graphs and to avoid misleading visualization,
#they should have the same scale
ax.set_ylim(0,7)
# Adding the legend
plt.legend(['Age: 15-24', 'Age: 25-49', 'Age: 50+'], loc='upper left')
plt.show()
return
We plot the results
In [22]:
#Plotting the results
Barploting(Age1)
Barploting(Age2)
Barploting(Age3)
One can notice that there are not too many difference between age classes. Apart from Tessin, other cantons do not seem to show any significant differences in terms of unemployment rate. We can even prove it using Pearson's correlation:
In [23]:
#Defining a name vector to visualize the data
Names = ['15-24', '25-49', '50+']
#Defining correlations
Correlations = [Swiss_age['Unemployment_tot'].corr(Swiss_age['15-24_Unemp']),
Swiss_age['Unemployment_tot'].corr(Swiss_age['25-49_Unemp']),
Swiss_age['Unemployment_tot'].corr(Swiss_age['50+_Unemp'] )]
UnemploymentCorr = pd.DataFrame(columns=[Names, Correlations])
print('Correlations between general unemployment and age categories:')
display(UnemploymentCorr)
There is a strong correlation between the general unemployment in cantons and the value for the youngest part of the active population. However it is normal because the total is aggretated from the ages. Being the least correlated shows that in general on the Swiss labour market, youngters tend to more unemployed than the rest.
Our strategy is to consider cantons that are crossed by the Röstigraben as fully belonging to one side, and then we deal with expection further in our work.
1. Roughly aggregate the data for the Roman and German part
2. Deal with exceptions
3. Display results
To aggreagate the data from all the cantons, we use df2_final from question 2. We cloned the dataframe before removing all the raw data (total active population per canton, total unemployment per canton).
Once we have the aggregate, we treat exceptions. We assume at first that Valais and Freiburg are part of Roman Switzerland, while Berne is considered as entirely German-speaking. We did such division because those region provide only one distric that belongs to the other side, which makes the data processing much simpler.
In [24]:
#As it was extracted in the second excercise, we use the RostiData.
#To extract the data per canton, we must reset the index to "Canton"
RostiData = RostiData.set_index('Canton')
#Putting all the Roman Cantons into a vector
RomanSwiss = ['Jura', 'Genf', 'Neuenburg', 'Wallis', 'Freiburg', 'Waadt' ]
#Getting the total active population and unemployed number, extract the number from RostiData
RomanUnemployed = 0
RomanActive = 0
for i in RomanSwiss :
RomanUnemployed += RostiData.loc[i, 'Nb Unemployed']
RomanActive += RostiData.loc[i, 'Active Population']
print(RomanUnemployed)
print(RomanActive)
#We can then compute the values for the German-speaking part, getting of the Tessin because it is not German-speaking
GermanUnemployed = pd.to_numeric(RostiData['Nb Unemployed'].sum()) - RomanUnemployed - RostiData.loc['Tessin', 'Nb Unemployed']
print(GermanUnemployed)
GermanActive = RostiData['Active Population'].sum() - RomanActive - RostiData.loc['Tessin', 'Active Population']
print(GermanActive)
Let's deal with the trickiest part, the cantons that are crossed by the Röstigraben.
We have 3 cases to handle, because 3 cantons are crossed by the Röstigraben: Valais, Fribourg and Bern. One can notice that the subregions are not crossed by the Röstigraben, which is a property that we will take advantage of.
For Fribourg, the only German-speaking subregion is Singine. From the following document on the administration's website:
http://www.fr.ch/spe/files/pdf95/infobulle_09_2017.pdf
One can find that the number of unemployed is 335, and that the unemployement rate is 1.4 %
The active population is expressed by round(335/0.014) = 23929 persons.
In [25]:
#Defining variables:
Singine_total_actpop = 23929
Singine_total_unemployed = 335
#Substracting this to the Roman part
RomanActive = RomanActive - Singine_total_actpop
RomanUnemployed = RomanUnemployed - Singine_total_unemployed
#Adding it to the German part
GermanActive = GermanActive + Singine_total_actpop
GermanUnemployed = GermanUnemployed + Singine_total_unemployed
In [26]:
HValais_total_actpop = 46000
HValais_total_unemployed = 368
#Substracting this to the Roman part
RomanActive = RomanActive - HValais_total_actpop
RomanUnemployed = RomanUnemployed - HValais_total_unemployed
#Adding it to the German part
GermanActive = GermanActive + HValais_total_actpop
GermanUnemployed = GermanUnemployed + HValais_total_unemployed
For Bern, only "Berner Jura" belongs to the French speaking part One can find the data of this "Arrondissement" from the following website: http://www.vol.be.ch/vol/fr/index/arbeit/arbeitslosigkeit.assetref/dam/documents/portal/Medienmitteilungen/de/2017/10/2017-10-10-vk-tabelle-september.pdf
One can find that the number of unemployed is 931, and that the unemployement rate is 3.4 % The active population is expressed by round(931/0.034) = 27382. persons.
In [27]:
JuraBern_total_actpop = 27382
JuraBern_total_unemployed = 931
#Substracting the data to the German part
GermanActive = GermanActive - JuraBern_total_actpop
GermanUnemployed = GermanUnemployed - JuraBern_total_unemployed
#Adding it to the Roman part
RomanActive = RomanActive + JuraBern_total_actpop
RomanUnemployed = RomanUnemployed + JuraBern_total_unemployed
In [28]:
#Computing the rates
RomanRate = (RomanUnemployed/RomanActive)*100
GermanRate = (GermanUnemployed/GermanActive)*100
print(RomanRate)
print(GermanRate)
We could not find any useful and efficient way to use JSON files to display the unemployement along the Röstigraben. One option we had was to roughly draw a line and then apply our data to it. However it seemed to imprecise to us. We chose to find a real map of the Röstigraben on the internet and we added our results manually. The website is the following:
http://www.kleinreport.ch/news/rostigraben-soll-zum-kulturerbe-werden-79466/
We are completely aware that this is by far not the best way to do it, especially after using those data visualization tools. However it remains the most accurate and the final visual is useful to understand the anwser to this question.
In [29]:
display(Image(filename='Rosti_Map.png'))
In [ ]: